Variance Reduction Methods for Sublinear Reinforcement Learning

نویسندگان

Sham M. Kakade

Mengdi Wang

Lin F. Yang

چکیده

This work considers the problem of provably optimal reinforcement learning for (episodic) finite horizon MDPs, i.e. how an agent learns to maximize his/her (long term) reward in an uncertain environment. The main contribution is in providing a novel algorithm — Variance-reduced Upper Confidence Q-learning (vUCQ) — which enjoys a regret bound of Õ( √ HSAT +HSA), where the T is the number of time steps the agent acts in the MDP, S is the number of states, A is the number of actions, and H is the (episodic) horizon time. This is the first regret bound that is both sub-linear in the model size and asymptotically optimal. The algorithm is sub-linear in that the time to achieve -average regret (for any constant ) is O(SA), which is a number of samples that is far less than that required to learn any (non-trivial) estimate of the transition model (the transition model is specified by O(SA) parameters). The importance of sub-linear algorithms is largely the motivation for algorithms such as Q-learning and other “model free” approaches. vUCQ algorithm also enjoys minimax optimal regret in the long run, matching the Ω( √ HSAT ) lower bound. Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive refinement method in which the algorithm reduces the variance in Q-value estimates and couples this estimation scheme with an upper confidence based algorithm. Technically, the coupling of both of these techniques is what leads to the algorithm enjoying both the sub-linear regret property and the (asymptotically) optimal regret.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Deep Reinforcement Learning

In reinforcement learning (RL), stochastic environments can make learning a policy difficult due to high degrees of variance. As such, variance reduction methods have been investigated in other works, such as advantage estimation and controlvariates estimation. Here, we propose to learn a separate reward estimator to train the value function, to help reduce variance caused by a noisy reward sig...

متن کامل

Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret

Lifelong reinforcement learning provides a promising framework for developing versatile agents that can accumulate knowledge over a lifetime of experience and rapidly learn new tasks by building upon prior knowledge. However, current lifelong learning methods exhibit non-vanishing regret as the amount of experience increases, and include limitations that can lead to suboptimal or unsafe control...

متن کامل

Cold-Start Reinforcement Learning with Softmax Policy Gradient

Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity ...

متن کامل

Stochastic Variance Reduction for Policy Gradient Estimation

Recent advances in policy gradient methods and deep learning have demonstrated their applicability for complex reinforcement learning problems. However, the variance of the performance gradient estimates obtained from the simulation is often excessive, leading to poor sample efficiency. In this paper, we apply the stochastic variance reduced gradient descent (SVRG) technique [1] to model-free p...

متن کامل

Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

Policy gradient methods have enjoyed great success in deep reinforcement learning but suffer from high variance of gradient estimates. The high variance problem is particularly exasperated in problems with long horizons or high-dimensional action spaces. To mitigate this issue, we derive a bias-free action-dependent baseline for variance reduction which fully exploits the structural form of the...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1802.09184 شماره

صفحات -

تاریخ انتشار 2018

Variance Reduction Methods for Sublinear Reinforcement Learning

نویسندگان

چکیده

منابع مشابه

Deep Reinforcement Learning

Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret

Cold-Start Reinforcement Learning with Softmax Policy Gradient

Stochastic Variance Reduction for Policy Gradient Estimation

Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

عنوان ژورنال:

اشتراک گذاری